Pornography Web Pages Classification with Textual Content Analysis Using Entropy Term Weighting Scheme for Small Class Dataset

نویسندگان

  • Lee Zhi Sam
  • Mohd Aizaini bin Maarof
  • Ali Selamat
  • Siti Mariyam Shamsuddin
چکیده

The fast growth of internet make objectionable web content such as pornography and violence easily explore to web users especially children and teenagers. Due to some popular web filtering techniques like Uniform Resource Locator blocking and Platform for Internet Content Selection checking are limited against today dynamic web content, hence content based analysis techniques with effective model are highly desired. This paper we propose textual content analysis model using entropy term weighting scheme to classify pornography and sex education web pages. We examine the entropy scheme with two other common term weighting schemes which are TFIDF and Glasgow. Those techniques are examined extensively with artificial neural network using small class dataset. We found that our proposed model archive better performance from the aspects of accuracy, convergence speed and stability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

InFeRno - An Intelligent Framework for Recognizing Pornographic Web Pages

In this work we present InFeRno, an intelligent web pornography elimination system, classifying web pages based solely on their visual content. The main characteristics of our system include: (i) a powerful vector space with a small but sufficient number of features that manage to improve the discriminative ability of the SVM classifier; (ii) an extra class (bikini) that strengthens the perform...

متن کامل

Web page feature selection and classification using neural networks

Automatic categorization is the only viable method to deal with the scaling problem of the World Wide Web (WWW). In this paper, we propose a news web page classification method (WPCM). The WPCM uses a neural network with inputs obtained by both the principal components and class profile-based features. Each news web page is represented by the term-weighting scheme. As the number of unique words...

متن کامل

Extracting Information from Web Content and Structure

Web is a vast data repository. By mining from this data efficiently, we can gain valuable knowledge. Unfortunately, in addition to useful content there are also many Web documents considered harmful (e.g. pornography, terrorism, illegal drugs). Web mining that includes three main areas – content, structure, and usage mining – may help us detect and eliminate these sites. In this paper, we conce...

متن کامل

A Novel Scheme for Improving Accuracy of KNN Classification Algorithm Based on the New Weighting Technique and Stepwise Feature Selection

K nearest neighbor algorithm is one of the most frequently used techniques in data mining for its integrity and performance. Though the KNN algorithm is highly effective in many cases, it has some essential deficiencies, which affects the classification accuracy of the algorithm. First, the effectiveness of the algorithm is affected by redundant and irrelevant features. Furthermore, this algori...

متن کامل

Improved Web Page Identification Method Using Neural Networks

In this paper, an improved web page classification method (IWPCM) using neural networks to identify the illicit contents of web pages is proposed. The proposed IWPCM approach is based on the improvement of feature selection of the web pages using class based feature vectors (CPBF). The CPBF feature selection approach has been calculated by considering the important term's weight for illicit web...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007